library(readr)
library(dplyr)
library(ggplot2)
library(sf)
library(leaflet)
library(forcats)Access to green space
rm(list = ls())# custom function for plotting leaflet
source("../R/plot_map.R")The data
data <- readr::read_csv("../data/greenspace_data.csv") %>%
rename(
lad20nm = ladnm,
lad20cd = ladcd
)View(data)Looking at the data, each row is a different Lower layer Super Output Area (LSOA). Looking at the columns we have few variables. The most important to note are:
lsoa11nm/cd: This is the name/code of the Lower Layer Super Output Area
lad20nm/cd: This is the name/code of the Local Authority District (in London, these are also called ‘Boroughs’)
rgn11nm: This is the name of the Region
average_distance: This is the average distance to greenspace in each LSOA
Because we’re interested in London, let’s subset to London
london_data <- data %>%
filter(rgn11nm == "London")
View(london_data)Now we have a new data set which contains just the London LSOAs.
We might be interested to see what the average distance to green space is for a particular London borough. As an example, let’s talk about the borough we’re in, which is Westminster.
london_data %>%
filter(lad20nm == "Westminster") %>%
summarise(
average_distance = mean(average_distance)
) %>%
pull()[1] 305.0393
Can you find the average distance for all LSOAs in a different LAD? Here are the LADs
london_data %>%
distinct(lad20nm) %>%
pull() [1] "City of London" "Barking and Dagenham" "Barnet"
[4] "Bexley" "Brent" "Bromley"
[7] "Camden" "Croydon" "Ealing"
[10] "Enfield" "Greenwich" "Hackney"
[13] "Hammersmith and Fulham" "Haringey" "Harrow"
[16] "Havering" "Hillingdon" "Hounslow"
[19] "Islington" "Kensington and Chelsea" "Kingston upon Thames"
[22] "Lambeth" "Lewisham" "Merton"
[25] "Newham" "Redbridge" "Richmond upon Thames"
[28] "Southwark" "Sutton" "Tower Hamlets"
[31] "Waltham Forest" "Wandsworth" "Westminster"
Try it below with a LAD (or several) of your choice!
london_data %>%
filter(lad20nm == "Westminster") %>%
summarise(
average_distance = mean(average_distance)
) %>%
pull()[1] 305.0393
It would be long to do this individually for all LADs. We can do some neat coding to calculate the average for each LAD
london_data %>%
group_by(lad20nm) %>%
summarise(
average_distance = mean(average_distance)
)# A tibble: 33 × 2
lad20nm average_distance
<chr> <dbl>
1 Barking and Dagenham 285.
2 Barnet 300.
3 Bexley 341.
4 Brent 318.
5 Bromley 290.
6 Camden 308.
7 City of London 227.
8 Croydon 311.
9 Ealing 281.
10 Enfield 296.
# ℹ 23 more rows
We can arrange the result so that the LADs with the shortest average distance appear at the top. You do this using the arrange() function, with the variable you want to arrange by placed in the parentheses. Give it a try (Note that first you’ll need to use the pipe operator %>% to pipe the results of the previous function to the next operation.)
Plots
If we want to plot this first we need to allocate the above to an object. You do this putting the name you want to give the object on the left followed by <- which means “is”, followed by the operations to produce the table. So, in effect, all we do is copy the above code and put something like lad_summary <- in front of it.
lad_summary <- london_data %>%
group_by(lad20nm, lad20cd) %>%
summarise(
average_distance = mean(average_distance)
) %>%
ungroup() %>%
mutate(
lad20nm = forcats::fct_reorder(lad20nm, desc(average_distance))
) %>%
arrange(average_distance)There is a trick we can do to make plotting look nicer. Try adding the lines below to your code above (remember to pipe %>% from the last line of your code above)
ungroup() %>%
mutate(
lad20nm = forcats::fct_reorder(lad20nm, desc(average_distance))
) %>%
arrange(average_distance)You’ll see we now have something called lad_summary on the right. Now we can use this to plot.
Bar chart
To plot, we use ggplot(). The basic format to plot a bar chart would be:
ggplot(data, aes(x, y)) +
geom_col()Using this format, have a go at plotting the percentage for each local authority
ggplot(lad_summary, aes(average_distance, lad20nm)) +
geom_col()ggplot works by adding layers. You’ll see above we’ve already done this by adding geom_col() on a new line. There a few other things we can do to make the plot looks nicer.
- Rename the axis labels.
- To do this, use
xlab("name of your choice")andylab"name of your choice")
- To do this, use
- Give the plot a title
- To do this, use
ggtitle("name of your choice")
- To do this, use
- Change the ‘theme’.
- To this, you could try
theme_minimal()
- To this, you could try
Try adding these components to the plot.
How does this way of looking at the data compare to just looking at the table?
Maps
Another cool way we can visualise data is using maps.
First we get the geospatial data. This details the boundaries of each LAD.
# load london geometry
lon_coords <- sf::read_sf("../data/london_coords.shp")We then merge the coordinate data with the summary data we’ve calculated for each local authority.
lad_summary_coords <- lad_summary %>%
merge(., lon_coords[,c('lad20cd', 'geometry')], by.x = "lad20cd", by.y = "lad20cd") %>%
st_as_sf(.) %>%
st_transform(., crs = '+proj=longlat +datum=WGS84') We can then use a custom function to plot the data. The function is called plot_map and you need to specify the data and the variable that you want to plot
What is the data? And what is the variable? Have a go at plotting it below.
plot_map(lad_summary_coords, "average_distance")What are your takeaways from this? Does it build more of a picture than the plot or table alone?
Task
Setting the scene
We now have an idea of how London boroughs perform in terms of how accessible their green space is. However, a problem with just using the average distance is that it might not capture the variation within a borough.
In Tower Hamlets for example there might be some areas with good access to green space and some areas with bad access, but the average cancels that out. Let’s take a quick look at Tower Hamlets as an example
london_data %>%
filter(lad20nm == "Tower Hamlets") %>%
ggplot(., aes(average_distance)) +
geom_histogram() Here we can see that actually there are many areas in Tower Hamlets with good access to green space, and few areas with quite bad access. So the average distance to green space doesn’t tell the whole story.
So, what would be an alternative? Have a think about it…
Instead of calculating the average distance to green space, we could instead count the number of LSOAs within a LAD that have good access. This is also makes sense because the level of LSOA is more relevant to where people live than the the level of LAD.
Your task now is to produce a plot and a map, as we have above, but now exploring the number of LSOAs within each LAD that are a “good” distance away.
First we need to define what is a good distance. Have a think about it and decide. What makes sense? How might we decide?
When you’re happy with a good distance, create a variable to record it below. Call it ‘good_distance’.
good_distance <- 400Now that we’ve decided, for each LAD we need to count how many LSOAs have an average distance that is below the good distance threshold.
We can do this by creating a new variable in london_data called hit. For each row in the data, we work out whether the value of average_distance is less than good_distance. If it is, we set hit equal to 1. If it isn’t, we set hit equal to 0.
We’ll achieve this using a for loop. A for loop iterates across a range of values, each time performing a function.
To illustrate the for loop, take a look at this:
for(i in 1:10){
print(i)
}[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
Here we have iterated from 1 to 10, each time printing the number. In our case, we want to iterate through every row in london_data.We can use the function nrow to iterate through all the rows in london_data. We won’t actually do this now because there are 4835 rows! But the point is that we have a way of peforming a function on every row of the data in turn.
Take a look at the code below. Can you tell what’s happening?
for(i in 1:nrow(london_data)){
if(london_data[i, "average_distance"] < good_distance){
london_data[i, "hit"] <- 1
}
else{
london_data[i, "hit"] <- 0
}
}View(london_data)When we look at london_data now, we can see we have a new column called hit.
Has our for loop done what we wanted?
london_data %>%
group_by(lad20nm, lad20cd) %>%
summarise(
count = sum(hit)
) %>%
arrange(desc(count))# A tibble: 33 × 3
# Groups: lad20nm [33]
lad20nm lad20cd count
<chr> <chr> <dbl>
1 Barnet E09000003 167
2 Ealing E09000009 164
3 Bromley E09000006 162
4 Croydon E09000008 161
5 Southwark E09000028 156
6 Enfield E09000010 144
7 Lewisham E09000023 143
8 Lambeth E09000022 137
9 Tower Hamlets E09000030 137
10 Hackney E09000012 136
# ℹ 23 more rows
Note that above we add desc() to arrange. This is because we want to see the best areas first. In this case the best areas have a higher value of count, because these areas have more LSOAs with good distances to green space.
What is the problem with just using a count of LSOAs?
What is a solution to this? How can we express the count data in a different way so that we can compare LADs?
lad_summary <- london_data %>%
group_by(lad20nm, lad20cd) %>%
summarise(
count = sum(hit),
total = n()
) %>%
mutate(
percentage = 100 * (count / total)
) %>%
arrange(desc(percentage))Here we can see that City of London scores very well on this metric.
Have a think about how you’d interpret this (clue: what does the total number of LSOAs suggest?)
By extension of the above, what else might be relevant to how well LADs score according to our metric? What do you think about this ranking? Does it feel like it makes sense?
Plot
Your task now is to explore this new metric in whatever way you want. As a start, you might want to try plotting the same things we’ve done above but for the percentage variable instead of average_distance.
The data you’ll want to use as a starting point is lad_summary
Bar chart
Tip: To plot, we use ggplot(). The basic format to plot a bar chart would be:
ggplot(data, aes(x, y))
+ geom_col()Remember also the other things you can add to the plot to tidy it up
Map
Tip: To plot a map, use the function plot_map
plot_map(data, "variable")